A primer on partially observable Markov decision processes (POMDPs)
نویسندگان
چکیده
Changes in species population size, habitat quality, presence or absence of threats, and environment climate are attributes that make ecological systems dynamic (Brown et al., 2001). In conservation applied ecology, making an informed decision to manage a system would ideally be based on perfect knowledge the state system, as one management action rarely fits all situations. State-transition models provide way representing these dynamics discrete set states transitions (Bestelmeyer 2003, 2017). When looking for optimal sequence decisions achieve objective, stochastic approaches their mathematical implementation, Markov processes (MDPs), go-to model ecologists (Bellman, 1957; Marescot 2013; Sigaud & Buffet, 2013). However, application implicitly explicitly assumes is can accurately identified. Unfortunately, difficult monitor our ability identify varies widely (Nichols Williams, 2006; Norouzzadeh 2018). While multiple monitoring available help practice many situations require managers complete information study (Field 2005). Such situation may occur when there no time collect more data (urgent decision-making), collecting additional costly time-consuming, simply impossible (Chadès Nicol, 2016a,b). unknown partially unknown, observable (POMDPs) will guide decision-making process. Introduced by Åström (1965), POMDPs generalise MDPs 1957) incorporating idea decision-makers might not able perfectly world state. Because common Artificial Intelligence (AI) Machine Learning (ML) problems, example, design smart autonomous robots cars (Cassandra, 1998), AI ML scientific community has developed exact, approximate (with performance guarantee) heuristic (without algorithms tackle formidable computational problem presents (Kaelbling 1998; Pineau 2003). while research focuses optimally solving large problems efficiently (Russell Norvig, 2002), ecology human-operated which interpretation explanation results perhaps important than overall performance. Here, drawing experience designing POMDP we present primer ecologists. Partially have been range applications. conservation, used explore dilemma between investing resources either on-ground surveillance cryptic-threatened 2008; Dujardin 2015; McDonald-Madden 2011), threats reintroductions listed (Nicol Chadès, 2012), decide survey nesting sites allow human use endangered seabird coastal forest (Tomberlin, 2010). invasive management, how long weeds with detect microscopic seeds (Regan preventing, searching destroying infestations severity infestation only uncertainly known (Rout 2014). natural resource economics, also helped determine influence costs reaching target vegetation (White, 2005), cost further (Fackler Haight, 2014; Haight Polasky, Recently, Memarzadeh al. (2019) showed POMDP-based avoid over-exploitation fisheries generating increased economic value. Accounting space, derive general priorities small networks meta-populations threatened diseases assuming Susceptible-Infected-Susceptible (SIS) 2011). More recently, it was shown were useful solve adaptive (Walters, 1986) (Williams, Indeed, uncertainty about optimise programmes 2012; Applications include protecting migratory shorebirds under uncertain consequences sea-level rise non-stationary 2013, 2015), assess (Memarzadeh Boettiger, 2018) inform recreation simultaneously preserve abundant eagle Denali National Park hiker access Despite growing literature topic, guidance define remains scarce—but see attempts operations (Lovejoy, 1991b; White, 1991), sciences (Monahan, 1982), 1998) psychology (Littman, 2009). fill this gap readers. We first introduce outline useful. then formally typology solve. explain some underlying theory makes challenging ways using selected number toolboxes. github repository https://github.com/conservation-decisions/POMDPproblems one-stop shop examples problems. discuss need understand solutions lead uptake ecology. Finally, reflect 10 years applying future directions. The Supporting Information provides much-needed steps involved installing running solvers (in R, C/C++ MATLAB). To MDP models, Sumatran tiger illustrative example Pascal 2020). At each step, allocate limited keep extant. defined locally extinct extant () represents status local at given step. completely case, assume population. choose two actions invest . transition probabilities represent system. probability becoming step following implementation Under drops possible recovery, thus matrices both (data from Chadès 2008): case tiger, assumed attracts funding comes cost. Implementing value corresponding difference implementing action. reward matrix as: Figure 1a compact graphical representation diagram. An diagram structure modelling variables (e.g. ) revealing probabilistic dependence (T) flow (arrows, Shachter, 1986). horizon criterion matter, challenges choice criteria. Finite accounting dimension complex addition, efficient designed infinite 1998). For sake simplicity, expected sum discounted rewards over horizon. (2013) step-by-step finite case. solved (using discount factor 0.95). function policy This means locally, managing. extinct, best stop managing do nothing. augment (Åström, 1965). convenient sequential optimisation decision-maker does current Together, (a) (b) MDP, (c) observable. conceptually computationally easier much longer history (Clark 2000; language statistics, describe control hidden (HMMs). framing, observed latent inferred (fully observable) observation states. Table 1 identifies different differentiating whether manager modelled imperfect detection species. Due low numbers cryptic nature, evaluating challenging. detection, could sighting, switch surveillance, ultimately surrender As setting, (). observations absent denote if tigers implement now opportunity camera traps increase —for ‘manage survey’ included but (2011). functions similarly It same effectiveness , is, s: detecting led observing 0.01. If implemented, 0.78. false positive occur, : problem, recommend clarify (Figure 1b; Defining element MDPs, defining before adding components problem. better understanding mechanistic insights adds 2011)). identified three types type classic where reducing uncertainty, usually through monitoring, exploitation (Type 1). second option, vary across space 2). third family 3). illustrates exploration (reduce uncertainty) (manage most applications address. previously discussed, every year must in: activities abate (manage); species' (survey); alternatively, they (do nothing). site relatively charismatic tiger), visitors area pay fee visit park, generates source income year. Finding maximise revenue park term (and therefore persistence population) explored SumatranTiger.pomdp repository), populations (McDonald-Madden 2011, Tiger2pop.pomdp) generalised species, well diseases, structured ((Chadès 2011; tiger-metapopT10.pomdp). key reduce least increases increasing balanced abatement threats. webapp (https://conservation-decisions.shinyapps.io/smsPOMDP/) Saving Species program, New South Wales (Australia) surveying (Pascal arises cannot resolved specific action, rather level always occurs background rate 2019; Regan (2011), plant uncertain. (fumigation); less (host denial crop) finally nothing (pasture crop). authors found strategy depended risk outside colonisation. weeds.pomdp repository. 2 resolve particularly ensure formulation worthwhile. done solution several (partially matters significantly likely approach. conditions met, POMDP. because interpret MDPs. those related 2012). Adaptive augmenting variable parameter (parameter (model uncertainty). often called Mixed Observability (MOMDP; Ong includes (see details). (2012), illustrate adaptively bird Gouldian finch. pervasive wild finch loss degradation caused inappropriate fire grazing regimes introduced predators such feral cats. response Each four experts provided model, distributions describing respond alternative threat actions. objective high context, point determining updated beliefs expert's ‘true’ made simulation example). Solving population, 2012, gouldian4exp.txt). Other protection 2015) assessment 2019) Readers interested studies into refer Fackler Pacifici (2014) Boettiger (2018). A consequence including definition solely single leads poor decision-making. Rather, depends past observations. Optimisation requires us search state-action histories select policy; however, grows exponentially time, store explicitly. neither practical nor tractable action-observation trajectory compute solution, track belief summarise overcome difficulties detection. distribution Intuitively, think time. Given state, assigned being equal 2), initial weighted 0.25 experts. finding mapping allocation resources). Similar horizon, formulations, consists values maximises immediate (first equation) (second equation). equation easy calculate, knowing Interested readers Williams (2011) apply programming algorithm iteration POMDPs. main challenge prevents continuous (belief) space. iteration, enumerate next know maximum states, amount continuous. reason, focused proposing exact Section 6). Most solution: -vectors graph directly. solutions, concept why critical policy. piecewise linear convex. arbitrarily closely upper envelope functions, (Smallwood Sondik, 1973; 1978). Once calculated, plot directly queried belief. obtained 13 APPL/SARSOP solver). implement. believe 0.8 0.2 extinct. Equation 4, -vector 9 associated graph. Policy graphs derived text file coding Cassandra, 2015). directed nodes (one node per -vector) edges edge node, 4). conveniently simple form 5a) interpreted humans simplified 5b). summary, solution. total good indicator difficulty interpreting (Figures 4 5). Although uptake, visualising endeavour. low-dimensional (2 3 states), represented graphically becomes impossible, amounts rapidly become too dense visualise interpret. Of interest published policies (2008), had summarised (manage, surrender) links thresholds ‘not seen x years’ exploited ‘time since detection’ explanatory Nicol 25 occurrences. cases, studying plausible scenarios (scenario analysis) deriving factored Hoey 1999). Essentially, takes advantage conditional independence akin Bayesian network representation. Factored formulations result trees networks, computation Methods section large, Perhaps simplest approach discretised methods. p subintervals variable. updating rule guarantee falls grid points interpolation discretised, 3, Fackler, technique studied sidestep intractability fixed (Fackler, Lovejoy, 1991a) (Zhou Hansen, grid-based methods differ mainly what shape takes. general, regular grids scale dimensionality non-regular suffer expensive routines. One inefficiency never visited during implementation. That point. Hence, point-based reachable (Kurniawati 2003; Shani solvers. Explaining implemented beyond scope manuscript, invite excellent technical reviews already Our explains install (Supporting Information; Historically, toolboxes available, Cassandra's pomdp-solve, proposes five algorithm. users, pomdp-solve solves without discounting. stopping criteria options. notably, toolbox input ‘.pomdp’ format ‘Tony's format’ parser reads Tony's format. Output files types, alpha vectors ‘.alpha’ extension ‘.pg’. Both opened editors. C benefits recent R wrapper (Table suits size MDPSolve 2011) MATLAB programming. created variety Pacifici, 2014) applies algorithms. discretising interpolating rectangular simplex 1991a; Zhou Then, MDP. extended documentation user's guide. Point-based (Pineau Spaan Vlassis, Typical sample simulating interactions its selection sampled Perseus performs iterative improvement function, ensuring stage improved. improve points. contrast other methods, updates (randomly selected) subset set, sufficient improving (Spaan written MATLAB, tool Symbolic (Poupart, uses similar Perseus, mechanisms (Williams APPL (Approximate Planning Toolkit) implements SARSOP (Successive Approximations Reachable Space Optimal Policies). solver samples 2008). coded C++ recognised efficiency (SARSOP won ICAPS 2011 planning competition, major researchers). (Boettiger 2020), attractive option familiar environment. MO-SARSOP, version allows factoring Information). MOMDPs Péron shows experimental POMDPs: uncertainty? problem? time-consuming creates developing interpretable solutions. chance success, lessons learned years. (MDP) very valuable especially little clear simpler version. gained include: dominates POMDP); appropriate appropriately discretised. Reducing smallest ensemble manageably represented, explained. Gauging far art improves experience. question develop minimum yet accurate inspired work. attempt asked ‘which matter?’ 2012) automated online devised Uther Veloso (1998) automatically split quality policy, resulting approaches. Ferrer-Mestres (2020) took opposite starting aggregating smaller minimising Assessing essential demonstrate Good comparing rules thumb 2019). worth discussing obvious recommendations analysis (VoI) provide. VoI evaluates gain collection exercise. such, cost-effectiveness projects (Pratt 1995; Wilson, substantially (Canessa Runge Xiao gains two-stage process: manage. spent unaccounted for. comparison, any process, sense, flexible powerful (EVPI/EVSI). calculations suited fast prototyping Several extensions proposed. advanced readers, -POMDPs (Araya —rather —that information-oriented Fehr 2018 proposed algorithms). Another causality (xPOMDP). motivated structural observational (Baggio 2016). investigated timing informativeness strategies discussed manuscript. covered These density projection (Kling 2017; 2010), reinforcement learning discretisation (Brechtel (Porta 2006), approximation (Sunberg Kochenderfer, deep (Igl 2018; Karkus hope entry Automatically building emerging explainable artificial intelligence (XAI). Traditionally, spaces. thousands are, practice, understand. With demand (Rudin, Rudin Radin, 2019), come benchmarks XAI literature. efforts simplify pruning (Dujardin 2015, 2021). But needs done. facilitate work domain, collate contribute ideas field unsuccessful write ago, thank TJ. Regan, C. Hauser, YM. Buckley TG. Martin support. Gwen Iacona providing feedback earlier Paul anonymous reviewer thoughtful comments. J.F.-M. supported CSIRO Research Office Postdoctoral Fellowship I.C. MLAI Future Science Platform (Activity Decisions). S.N. Julius Career Award. conflict declare. research; writing original manuscript support authors; L.V.P., ran performed simulations. All edited improved peer review article https://publons.com/publon/10.1111/2041-210X.13692. Zenodo https://doi.org/10.5281/zenodo.5234598 (Chades Please note: publisher responsible content functionality supporting supplied authors. Any queries (other missing content) should author article.
منابع مشابه
Partially observable Markov decision processes
For reinforcement learning in environments in which an agent has access to a reliable state signal, methods based on the Markov decision process (MDP) have had many successes. In many problem domains, however, an agent suffers from limited sensing capabilities that preclude it from recovering a Markovian state signal from its perceptions. Extending the MDP framework, partially observable Markov...
متن کاملBounded-Parameter Partially Observable Markov Decision Processes
The POMDP is considered as a powerful model for planning under uncertainty. However, it is usually impractical to employ a POMDP with exact parameters to model precisely the real-life situations, due to various reasons such as limited data for learning the model, etc. In this paper, assuming that the parameters of POMDPs are imprecise but bounded, we formulate the framework of bounded-parameter...
متن کاملQuantum partially observable Markov decision processes
Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. We present quantum observable Markov decision processes (QOMDPs), the quantum analogs of partially observable Marko...
متن کاملInducing Partially Observable Markov Decision Processes
In the field of reinforcement learning (Sutton and Barto, 1998; Kaelbling et al., 1996), agents interact with an environment to learn how to act to maximize reward. Two different kinds of environment models dominate the literature—Markov Decision Processes (Puterman, 1994; Littman et al., 1995), or MDPs, and POMDPs, their Partially Observable counterpart (White, 1991; Kaelbling et al., 1998). B...
متن کاملQuasi-Deterministic Partially Observable Markov Decision Processes
We study a subclass of POMDPs, called quasi-deterministic POMDPs (QDET-POMDPs), characterized by deterministic actions and stochastic observations. While this framework does not model the same general problems as POMDPs, they still capture a number of interesting and challenging problems and, in some cases, have interesting properties. By studying the observability available in this subclass, w...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Methods in Ecology and Evolution
سال: 2021
ISSN: ['2041-210X']
DOI: https://doi.org/10.1111/2041-210x.13692